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Abstract 

We study the problem of agnostically learning halfspaces which is defined by a fixed but 
unknown distribution V on Q" x {±1}. We define ErrHALF(n) as the least error of a halfspace 
classifier for I?. A learner who can access V has to return a hypothesis whose error is small 
compared to ErrHALF(n). 

Using the recently developed method of [33] we prove hardness of learning results assuming 
that random A-XOR formulas are hard to (strongly) refute. We show that no efficient learning 
algorithm has non-trivial worst-case performance even under the guarantees that ErrnALF {'D) < 
r] for arbitrarily small constant rj > 0, and that V is supported in {±1}" x {±1}. Namely, even 
under these favorable conditions, and for every c > 0, it is hard to return a hypothesis with 
error < ^ In particular, no efficient algorithm can achieve a constant approximation ratio. 

Under a stronger version of the assumption (where K can be poly-logarithmic in n), we can take 
rj = for arbitrarily small u > 0. These results substantially improve on previously 

known results [31, 38, 50, 51, 44], that only show hardness of exact learning. 
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1 Introduction 


In the problem of agnostically learning halfspaces, a learner is given an access to a distribution T) 
on X {±1}. The goal is to output^ (a description of) a classifier h : Q" —)• {±1} whose error, 
Errx)(/i) := (^(^) / 2/)) is small comparing to ErrHALF(T’) - the error of the best classifier 

of the from hw{x) = sign((r(;, x)). We say that a learning algorithm learns halfspaces if, given an 
accuracy parameter e > 0, it outputs a classifier with error at most ErrHALF(T’) + e. The learner 
is efficient if it runs in time poly (n, and the output hypothesis can be evaluated in poly (n, 
time given its description. The learner has an approximation ratio a = a{n) if it is guaranteed to 
return h with Errx)(/i) < a ■ ErrHALF(T’) + e. We emphasize that we consider the general, improper, 
setting where the learner has the freedom to return a hypothesis that is not a halfspace classifier. 

The problem of learning halfspaces is as old as the field of machine learning, starting with the 
perceptron algorithm [56, 55], through the modern SVM [27, 61, 60, 28, 29] and AdaBoost [58, 57, 
40]. Halfspaces are widely used in practice, have been extensively studied theoretically, and in fact 
motivated much of the existing theory, both the statistical and the computational. 

Despite all that, the gap between the performance of best known algorithms and best known 
lower bounds is dramatic. Best known efficient algorithms [48] for the problem have a poor approx¬ 
imation ratio of D (n), and have performance better than trivial only when ErrHALF(D) < © (^)- 
As for lower bounds, strong A/'T’-hardness results are known [8, 5, 41, 38] only for algorithms that 
are restricted to return a halfspace classifier (a.k.a. proper algorithms). For general algorithms, no 
AA'P-hardness results are known, yet several results [31, 38, 50, 51, 44] show that it is hard to agon- 
stically learn halfspaces under several (cryptographic and average case) complexity assumptions. 
However, these results are quantitatively very weak, as they only rule out exact learning (i.e., with 
approximation ratio 1). For example, they do not rule out algorithms that predict only 1.001 times 
worst than the best halfspace classifier. 

The main result of this paper is a quantitatively strong hardness results, assuming that (strongly) 
refuting random itT-XOR formulas is hard. Using the recently developed framework of the author 
with Linial and Shalev-Shwartz [33], we show that for arbitrarily small constant rj > 0 and every 
c > 0, no poly(n)-time algorithm can return a hypothesis with error < ^ ~ even when it is 
guaranteed that ErrHALF(D) < rj, and that V is supported in {±1}"' x {±1}. This implies in 
particular, that there is no efficient learning algorithm with a constant approximation ratio. Under 
a stronger version of the assumption (where K is allowed to be poly-logarithmic in n), we can take 
r] = 2~ for arbitrarily small n > 0. This implies hardness of approximation up to a factor 

of 2*°§ Interestingly, this is even stronger than the best known results [8, 41, 38] for proper 

algorithms. 

1.1 The random iT-XOR assumption 

Unless we face a dramatic breakthrough in complexity theory, it seems unlikely that hardness 
of learning can be established on standard complexity assumptions such as P / NP (see [6, 33]). 
Indeed, all currently known lower bounds are based on cryptographic and average-case assumptions. 
One type of such assumptions, on which we rely in this paper, concern the random R-XOR problem. 
This problem has been extensively studied, and assumptions about its intractability were used to 
prove hardness of approximation results [1], establish public-key cryptography [1, 7], and statistical- 
computational tradeoffs [11]. 

^Throughout, we require algorithms to succeed with a constant probability (that can be standardly amplified by 
repetition). 
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A K-tuple is a mapping C : {±1}”' —)> {ztl}^ in which each output coordinate is a literal and 
the K literals correspond to K different variables. The collection of iT-tuples is denoted Xn,K- A 
K-formula is a collection J = {Ci, ..., Cm} of AT-tuples. An instance to the AT-XOR problem is 
a AT-formula, and the goal is to find an assignment i}) G {± 1 }^^ that maximizes VAL. 0 ^xor(<^) := 
|{j.xoRx^,(b))-i}| ^ define the value of J as VALxor(«^) := max^e{±i}" VAL^^xor(«^)- We 
will allow K to vary with n (but still be fixed for every n). For example, we can look of the 
[log(n)]-XOR problem. 

We will consider the problem of distinguishing random formulas from formulas with high 
value. Concretely, for m = m{n), K = K{n) and ^ > r] = rj{n) > 0, we say that the problem 
CSP5^'^'^’^“^(XORi^) is easy, if there exists an efficient randomized algorithm, A with the following 
properties. Rs input is a A'-formula J with n variables and m constraints and its output satisfies: 

• If VALxor(«/) > 1 - ?/, then 

3 

Pr {A{J) = “non-random”) > - 
coins oi A 4 

• If J is random^, then with probability 1 — On{C) over the choice of J, 

Pr {A{J) = “random”) > ^ . 
coins of A 4 

It is not hard to see that the problem gets easier as m gets larger, and as rj gets smaller. For r] = 0, 
the problem is actually easy, as it can be solved using Gaussian elimination. However, even for 
slightly larger r/’s the problems seems to become hard. For example, for any constant rj > 0, best 
known algorithms [37, 25, 26, 11, 3] only work with m = il, ^. In light of that, we put forward 
the following two assumptions. 

Assumption 1.1 There are constants c > 0 and ^ > rj > 0 such that for every K and m = 
^ciog(K)VK^ t/ie problem CSP[^“^’i-^(XORi^) is hard. 

Assumption 1.2 There are constants c > 0 and ^ > rj > 0 such that for every s, K = log^(n) 
and m = the problem CSP™^’^“^(XORx) is hard. 

We outline below some evidence to the assumptions, in addition to known algorithms’ perfor¬ 
mance. 

Hardness of approximation. Hastad’s celebrated result [43] asserts that if P 7 ^ NP, then for 
every r/ > 0, it is hard to distinguish AT-XOR instances with value > 1 — r/ from instances with with 
value C^ + r]. Since the value of a random formula is approximately we can interpret Hastad’s 
result as claiming that it is hard to distinguish formulas with value 1 — from “semi-random” 
A-XOR formulas (i.e., formulas whose value is approximately the value of a random formula). 
Therefore, our assumptions can be seen as a strengthening of Hastad’s result. 

Hierarchies and SOS lower bounds. A family of algorithms whose performance has been 
analyzed are convex relaxations [24, 59, 2] that belong to certain hierarchies of convex relaxations. 
Among those hierarchies, the strongest is the Lasserre hierarchy (a.k.a. Sum of Squares). Algo¬ 
rithms from this family achieves state of the art results for the A-XOR and many similar problems. 
In [59] it is shown that relaxations in the Lasserre hierarchy that work in sub-exponential time 
cannot solve CSP™^‘^’^~^(XORj^) for any ? 7 ,e > 0. 

■^To be precise, the X-tuples are chosen uniformly, and independently from one another. 
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Lower bounds on statistical algorithms. Another family of algorithms whose performance 
has been analyzed are the so-called statistical algorithms. Similarly to hierarchies lower bounds, the 
results in [39] imply that statistical algorithms cannot solve for any r],e> 0. 

Resolution lower bounds. The length of resolution refutations of random AT-XOR formulas 
have been extensively studied (e.g. [42, 13, 14, 19]). It is known (Theorem 2.24 in [18]) that random 
formulas with constraints only have exponentially long resolution refutations. This shows 

that yet another large family of algorithms (the so-called Davis-Putnam algorithms [35]) cannot 
efficiently solve CSP'^“'^’^(XORi^) for any e > 0 . 

Similar assumptions. Several papers relied on similar assumptions. Alekhnovich [1] assumed 
that CSP™ ^(XORa) is hard for some r? < g for every C > 0. Applebaum, Barak and 
Wigderson [7] assumed that (XOR 3 ) is hard. Barak and Moitra [11] made the 

assumption that CSP’^'^]J‘^^~’^(XORi^) is hard for every e > 0 and V = \ ~ o(l)- Assumptions 

n"2" ^ 

on predicates different than AT-XOR were made in [32, 36]. The assumption in [36] implies that 
CSPp“'^’^~^(XOR 3 ) is hard for every C > 0 and rj > 0. The assumption in [32] implies that 
Csprand^i-77(X0R3) jg hard for every e > 0 and rj > 0. A much more general assumptions was 
made in [12]. It implies in particular that CSP^)(‘^’^~’'(XORif) is hard for every C > 0 and 77 > 0. 

1.2 Previous Results and Related work 

Upper bounds. When ErrHALF(R’) = 0, the problem of learning halfspaces can be solved effi¬ 
ciently using linear programming. However, even for slightly larger error values the problem seems 
to become much harder. Currently best known algorithms [48] have non trivial performance only 
when it is guaranteed that ErrHALF(l^) < This algorithm also achieves approximation ra¬ 

tio of n, which is currently the best known approximation ratio. Better guarantees are known 
under various assumptions on the marginal distribution [10, 20, 30, 44]. For example, a PTAS is 
known [30] when the marginal distribution is uniform. 

Lower bounds for general algorithms. Several hardness assumptions imply that it is hard 
to agnostically learn halfspaces. Feldman, Gopalan, Khot and Ponnuswami [38] have shown that 
based on the security of the Ajtai-Dwork cryptosystem. Kalai, Klivans, Mansour and Servedio 
showed the same conclusion [44] based on the hardness of learning parity with noise. Daniely and 
Shalev-Shwartz [31] derived the same conclusion based on the hardness of refuting random Tf-SAT 
formulas. Klivans and Kothari [50] showed that assuming that learning sparse parity is hard, it is 
hard to learn halfspaces even when the marginal distribution is Gaussian. We note that all these 
results only rule out exact algorithms, but say nothing about approximation algorithms. We note 
however that by [9, 15], an algorithm with non trivial performance of j^-realizable distributions will 
result with quasi-polynomial time algorithm for learning constant depth circuits. 

Lower bounds for proper algorithms and hardness of approximation. When we restrict 
the algorithms to return a halfspace classifier, the problem of learning halfspaces is essentially 
equivalent [54] to the computational problem of minimizing disagreements. In this problem we are 
given a sample S = {(xi, yi),..., (xm, Vm)} £ Q” x {±1}, and the goal is to find a vector w G Q” 
that minimizes the fraction of pairs with sign((ri;, x*)) 7 ^ y^. The optimal fraction is called the 
error of the sample. As a “standard” and basic computational problem, much is known about 
it. The problem have been shown MV-haid already in Karp’s famous paper [46]. Soon after the 
discovery of the PGP theorem, Arora, Babai, Stern and Sweedyk [ 8 ] have shown that assuming that 
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no quasi-polynomial time algorithm can solve AA'P-hard problems, there is no efficient algorithm 
with an approximation ratio of Later on, the corresponding maximization problem was 

considered by several authors [4, 17, 23, 41, 38], culminating with Feldman, Gopalan, Khot and 
Ponnuswami [38] who showed that assuming that no quasi-polynomial time algorithm can solve 
MV hard problems, no efficient algorithm can distinguish samples with error > 5 — from 

samples with error < 

Statistical-Queries Lower bounds. Statistical queries (SQ) algorithms [47] is a class of learn¬ 
ing algorithms whose interaction with D is done only via statistical queries. Concretely, an SQ- 
algorithm can specify any function Q : {± 1 }”' x {±1} —{if} and an error parameter A > 0, and 
receive a number e satisfying \E(^^^y'^^D[Q{x,y)] — e| < A. In this model, an algorithm is efficient 
if it makes polynomially many queries with error parameters satisfying j < poly(n) (besides that, 
the algorithm is not restricted). While this class is strictly smaller than the class of all algorithms, 
most known algorithms admit an SQ version. In addition, as opposed to general algorithms, uncon¬ 
ditional lower bounds are known for SQ algorithms [21] for several learning problems. In particular, 
it is known that it is hard to agnostically learn halfspaces using SQ-algorithm (e.g. [45]). We note 
that as with previously known lower bounds for general algorithms, these results only rule out exact 
algorithms. 

Lower bounds on concrete algorithms. A few results [16, 53, 34] showing hardness of ap¬ 
proximation results for several concrete families of algorithms (linear methods). 

The methodology of [33]. In light of the great success of complexity theory in establishing 
hardness of approximation results for standard computational problems, having such dramatic gaps 
between upper and lower bounds is perhaps surprising. The reason for the discrepancy between 
learning problems and computational problems is the fact that it is unclear how to reduce NP-hard 
problems to learning problems (see [ 6 , 33]). The main obstacle is the ability of a learning algorithm 
to return a hypothesis which does not belong to the learnt class (in our case, halfspaces). Until 
recently, there was only a single framework, due to Kearns and Valiant [49] , to prove lower bounds 
on learning problems. The framework of [49] makes it possible to show that certain cryptographic 
assumptions imply hardness of certain learning problems. As indicated above, the lower bounds 
established by this method are quite far from the performance of best known algorithms. 

In a recent paper [33] (see also [32]) we, together with Linial and Shalev-Shwartz, developed a 
new framework to prove hardness of learning based on hardness on average of CSP problems. Yet, 
in [33] we were not able to use our technique to establish hardness results that are based on a natural 
assumption on a well studied problem. Rather, we made a rather speculative hardness assumption, 
that is concerned with general CSP problems, most of which were never studied explicitly. We 
recognized it as the main weakness of our approach, and therefore the main direction for further 
research. About a year after, Allen, O’Donnell and Witmer [3] refuted the assumption of [33]. On 
the other hand we [31] were able to overcome the use of our speculative assumption, and prove 
hardness of learning of DNF formulas (and other problems) based on a natural assumption on 
the complexity of refuting random AT-SAT instances, in the spirit of Feige’s assumption [36]. The 
current paper continues this line of work. 
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1.3 Results 


Main result We say that a distribution T) on Q"" x {±1} is r]-almost realizable if ErrHALF(l^) < V- 
An algorithm have non-trivial performance w.r.t. a certain family of distributions if for some c > 0, 
its output hypothesis has error < ^ ^ whenever the underlying distribution belongs to the family. 

Theorem 1.3 • Under assumption 1.1, for all p > 0, there is no poly(n)-time algorithm with 

non-trivial performance on rj-almost-realizable distributions on {±1}”' x {±1}. 

• Under assumption 1.2, for all iz > 0, there is no po\y{n)-time algorithm with non-trivial 

performance on -almost-realizable distributions on {±1}” x {±1}. 

These results imply in particular that under assumption 1.1 there is no efficient learning algorithm 
with a constant approximation ratio, and under assumption 1.2 the is no efficient learning algorithm 
with an approximation ratio of 2'°^ As mentioned above, this substantially improves on 

previously known results that only showed hardness of exact learning. 

Extension to large margin learning Large margin learning is a popular variant of halfspace 
learning (e.g. [60, 61]). Here, the learning algorithm faces an somewhat easier task, as it is not 
required to classify correctly examples that are very close to the separating hyperplane. Our basic 
theorem can be extended to the large margin case. Concretely, we say that a distribution 2? on 
{±1}"' X {±1} is r/-almost-realizable-with-margin if there is w £ with — poly(^) 

such that {y ■ {w,x) < 1) < r/. Theorem 1.3 can be extended to show that no efficient 

algorithm can perform better than trivial even when the distribution is almost realized with margin. 
Concretely, we have the following theorem. 

Theorem 1.4 • Under assumption 1.1, for all y > 0, there is no poly(n)-time algorithm with 

non-trivial performance on distributions on {±1}” x {±1} that are p-almost-realizable-with- 
margin. 

• Under assumption 1.2, for all iz > 0, there is no po\y{n)-time algorithm with non-trivial 
performance on distributions on {±1}*^ x {±1} that are 2“^°®^ -almost-realizable-with- 
margin. 

Statistical queries version Our proof technique can be adapted to show unconditional lower 
bounds for SQ-algorithms. Concretely, we have the following. 

Theorem 1.5 There is no efficient SQ-algorithm with non-trivial performanee on distributions on 
{±1}"' X {±1} that are 2“^°® -almost-realizable-with-margin. 

Again, this substantially improves on previously known results that only showed hardness of exact 
learning. 

Implications to hardness of approximation As explained in the previous section, the problem 
of proper learning is essentially equivalent to the problem of minimizing disagreements. As our 
results hold in particular for proper algorithms, we can conclude the following. 

Theorem 1.6 Under assumption 1.2, for every e > 0 and c > 0, no efficient algorithm can 
distinguish samples with error > 5 — n~^ from samples with error < 2 “'°^ . 
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We note that the conclusion of our theorem is stronger than the conclusions of the previously best 
known lower bounds for the problem (however, we rely on a stronger assumption). In particular, 
it implies that the problem is hard to approximate within a factor of 2*°® implying the 

conclusion [8] of Arora et. al. It is also strengthen the conclusion [38] of Feldman et. ah, improving 
the completeness parameter from to 2“^°® and the soundness parameter from ^ — 

to 2 — n~'^. We remark that our result holds even if we assume that the input examples 
are binary, and in the large margin settings. 


2 Main proof ideas 

We next elaborate on the main ideas of our main theorem. The full proof is deferred to the 
appendix. 

2.1 The methodology of [33] 

We first describe the methodology of [33] to prove hardness of learning. A sample is a collection 
S = {{xi,yi),...,{xm,ym)} c X X {±1}. The error of h : X ^ {±1} is Err,j(S’) = 

The error of S w.r.t. a hypothesis class Ti C {±1}'’'- is Err'^(S') = Err/i(S). The basic idea 

behind [33] is that if it is hard to distinguish a sample with small error form a sample that is very 
random in a certain sense, then it is hard to (even approximately) learn T-L. 

For the sake of concreteness, we restrict to the problem of learning halfspaces over the boolean 
cube. Namely, we take X = {±1}” and % = HALF = {h^ \ w G M"'}. We say that a sample 
is strongly scattered^ if the labels (i.e., the yfs) are independent fair coins (in particular, they are 
independent from the xfs). For m = m{n) and rj = r]{m), we denote by HALF^®'^®'*’^ the problem 
of distinguishing a strongly-scattered sample from a sample with ErrHALF(*S') < p- Concretely, we 
say that the problem is easy if there exists an efficient randomized algorithm A with the following 
properties. Its input is a sample S = {{xi,yi),..., {xm,ym)} C {±1}*^ x {±1}, and its output 
satishes: 

• If ErrHALF(*S') < r/, then 

Pr {A{S) = “almost-realizable”) > ^ 
coins oi A 4 

• If 5 is strongly scattered then, with probability 1 — 0^(1) over the choice of the labels, 

Pr (»4,(5) = “scattered”) > ^ . 
coins of ^ 4 

Theorem 2.1 [33] If for every a > 0 the problem is hard, then there is no efficient 

learning algorithm with non-trivial performance on rj-almost realizable distributions. 

To be self contained, and since Theorem 2.1 is not identical to [33], we outline a proof. 

Proof Assume toward a contradiction that the efficient algorithm £ is guaranteed to return a 
hypothesis with error < ^ ^ on ry-almost realizable distributions. Let M (n) be the maximal 
number of random bits used by £ when the examples lie in {±1}". This includes both the bits 

weaker notion, called scattering, was used in [33]. 
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describing the examples produced by the oracle and “standard” random bits. Since C is efficient, 
M (n) < rf for some d > 0. Define 

q{n) = . 

We will derive a contradiction by showing how C can be used to solve To this end, 

consider the algorithm A defined below. On input S = {(xi, 7 / 1 ),..., {xm, Vm)} C {±1}” x {±1}, 

1. Run L on S', such that the examples’ oracle generates examples by choosing random examples 
from S. 


2. Let h be the hypothesis that L returns. If Err 5 (/i) < ^ output “almost-realizable”. 

Otherwise, output “scattered”. 


Next, we show that A solves Indeed, if ErrHALF(*S') < rj, then £ is guaranteed 

to return a hypothesis with Err 5 (/i) < ^ and A will output “almost-realizable”. What if 

S is strongly scattered? Let G C be the collection of functions that £ might return. 

c' / 

We note that \G\ < 2 ”' , since each hypothesis in G can be described by bits. Namely, the 
random bits that £ uses and the description of the examples sampled by the oracle. Now, since 
V is strongly-scattered, by Hoeffding’s bound, the probability that Err 5 (/i) < 5 — ^ for a single 

h : {±1}"' —)■ {±1} is < exp By the union bound, the probability that Err 5 (/i) — ^ 


for some /i G ^ is at most \G\ exp 2^|^^ < 2" 
that A responds “almost-realizable” is o(l). 




= 2 ’’ 


It follows that the probability 


□ 


2.2 An overview 

Eor the sake of simplicity, we will hrst explain how to prove a weaker version of Theorem 1.3. At 
the end of this section we will explain how to prove Theorem 1.3 in full. 

Theorem 2.2 Suppose that for every K > A the problem ^°°(XORa:) is hard. Then, 

there is no efficient algorithm with non-trivial performance on ^^-almost-realizable distributions 
on {-1,1,0}’" X {±1}. 

The course of the proof is to reduce CSP™*^’^ (XOR^) to HALE* Given that reduction, 

„41og(K) 

and since K can be arbitrarily large, Theorem 2.2 follows from Theorem 2.1. 


The XOR problem as a learning problems 

The basic conceptual idea is to interpret the iL-XOR problem as a learning problems. Every 
G {±1}”' naturally defines h^ : Xn,K —^ by mapping each iL-tuple C to the truth value 

of the corresponding constraint, given the assignment ip. Namely, h.^{C) = XOR(C(7/^)). Now, we 
can consider the hypothesis class TLk = | 'G £ {=*=1}"'}. 

The A-XOR problem can be now formulated as follows. Given J = {Ci,...,Cm} C 
find G Hk with minimal error on the sample S = {(Gi, 1),..., {Cm, 1)}. Now, the problem 

CSPm”'^’^ "““(XORa-) is the problem of distinguishing between the case the Err'^^(5) < and 
the case where the different Cfs where chosen independently and uniformly from X^^k- 

The mapping J 1 —)• 5 is still not a reduction from GSPm"^^’^ "““(XORa") to the problem of 
distinguishing a strongly scattered sample from a sample with small error w.r.t. halfspaces. This 
is due to the following points: 
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• In the case that J is random, S is, in a sense, “very random”. Yet, it is not strongly-scattered. 

• We need to measure the error w.r.t. halfspaces rather than %k- 
Next, we explain how we address these two points. 

Making the sample scattered 

Addressing the first point is relatively easy. Given a sample (Ci, 1),..., {Cm, 1), we can produce 
a new sample {C'i,y [),..., (C^, y'^) as follows: for every i G [m], w.p. | we let (C', y') = (Cj, 1) 
and w.p. 2 we let {C[,y'j) = (C',—1), where is obtained from Ci by flipping the sign of the 
first literal. It is not hard to see that this reduction maps random instances to strongly scattered 
instances. Also, it is not hard to see that the error of every G T-L^ does not change when moving 
from the original sample to the new sample. Therefore, samples with error < are mapped to 
samples with error < 

Note that the reduction not only maps random instances to scattered instances, but in fact has 
a stronger property that will be useful for addressing the second point (Replacing T-Lk by HALF). 
Concretely, if the original instance is random, then the (C'),y()’s are independent and uniform 
elements from Xn,K x 

Replacing T-Lk by HALF 

To address the second point, we will show a reduction that maps a sample S = {((71,^1),..., {Cm, ym)} C 
Xn,K X {±1} to a sample (xi,yi),..., {xm,ym) £ ^ x {±1}. It will be convenient 

to give the reduction the option to output “not-random” instead of producing a new sample. For 
large enough iL, if m = n7, the reduction has the following properties: 

• If the original sample is random (i.e., the examples {Ci,yi) where chosen independently and 
uniformly from Xn^K x {±1}), then w.h.p. the reduction will produce a new sample, and 
given that a new sample was indeed produced, it will be strongly scattered. 

• If the original sample has error < w.r.t. %k then the reduction will either say that 
the original sample is not random, or it will produce a new sample with error < w.r.t. 
halfspaces. 

Putting n' = and noting that m = ^ it is not hard to see that such a reduction, 

together with the previous reduction (the “scattering reduction”), indeed reduces (XORx) 

nT 

to HALF'"'^^’^. 

^41og{if) 

The reduction will work as follows. It will first test that J := {Ci,..., Cm} is pseudo random, 
in the sense that is satisfies certain (efficiently verifiable) properties (that will be specified later) 
that are possessed by random formulas w.h.p. If the test fails, the reduction will say that S is not 
random. Otherwise, the reduction will produce the sample x(5') := (x(C'i), yi ),..., {x{Cm),ym) £ 

{—1,1,X {±1}, for a mapping x : Xn^K —^ {“1; 1; specify later. 

Next, we explain why this reduction have the desired properties. We start with the case that 
S is sandom. In that case, the pseudo-randomness test will pass w.h.p. and the reduction will 
produce a new sample. Also, since the properties tested in this test do not depend on the labels, 
given that a new sample was indeed produced, the new sample is strongly scattered. 



We next deal with the case that Err'^j^(S') < We will show that in this case either the 
reduction will say that S is not random, or the new sample will have error < ^ w.r.t. halfspaces. 

To this end we will finally specify x ^md describe the list of pseudo-random properties. 

It will be convenient to define x as a composition x = P o where vr : ,K —>■ {—!) 1,0}”^ 

and p : {——)■ { —1,1,We first define vr. The indicator vector of a literal is 
the vector in {0, —1,1}” whose all coordinates are zero except the coordinate corresponding to the 
literal, that is 1 (—1) if the literal is un-negated (negated). We define tt{C) as a concatenation 
of K vectors, where the i’s vector is the indicator vector of the i’th literal in C. As for p, we 
let p{x) G {—1,1)0}” be a vector consisting of all products of the form Xi^ ■ ... ■ Xi^ for 

r < ^y/K\og{K), and padded with zeros in the remaining coordinates. We note that for large 

enough n, the number of such products is < {riK + 

Now, it is not hard to see that the error of x(5') w.r.t. halfsapaces is exactly the error of ti{S) 
w.r.t. POL^ - the hypothesis class consisting of threshold functions of polynomials of degree at most 

d = \y/K\og{K) . Therefore, we will want to show that if Err'p^(S') < then ErrpoLtj(7''(5')) < 

2 

100 - 

Suppose that Err'Pj^(5) < jqq, and let G {il}”" such that ETih^{S) < jqq. To show that 
ErrpoL,i(7!'(*S')) < it is enough to construct a degree < d polynomial p : { — 1,1,0}”^ —>■ M such 
that Pr^v^jm] {hp{Cj) / sign(p(7r(C'j)))) < Indeed, in that case, 

ErrpoLrf(7r(S)) < Fi (yj sign{p{7r{Cj)))) 

^ Pr , iVj + h^iCj)) + Pr {hp{Cj) ^ sign(p(7r(C'Y)))) 

^ 1 1 1 2 

We note that we will actually find p with a stronger property - namely, that PrY..^[^] {hp{Cj) / p(7r(Cj))) < 
This stronger property is not needed for proving the simplified version (Theorem 2.2), but will 
be needed for proving Theorem 1.3. 

Unfortunately, it is not the case that we can always find such a polynomial. To overcome it, as 
explained above, the reduction will first check a set of pseudo-random properties. Concretely, for 
all small enough sets of literals, the reduction will check that the number of AT-tuples that contain 
all the literals in the set is close to what is expected for a random sample. For example, it will 
check that the fraction of A-tuples containing the literal xj is approximately 

It therefore remains to show that for pseudo-random S there is degree < d polynomial p : 

{ — 1,1,0}”^ —)> M such that PrY..^j^] {hp{Cj) /p(7r(C'j))) < To this end, consider the linear 
map —)> that is defined as 

Tp {vi\v2\ . . . \vk) = {{vi, 1 p), {V2,'lp), . . . , {vK,i^)) 

We note that VC G Xn^K, Tp{'K{C)) = C('0). We will consider polynomials of the form p = p' o 
where p' : {±1}^ —)> M is a degree < d polynomial. We note that for such p we have 

Pr (^(C,) /p(7r(C,))) = Ft {XOR{C^m^p'{Cjm)) 

It is therefore enough to find a degree < d polynomial p' for which Pr 2 ..^x>(v>) (XOR(z) = p'{z)) > 

0.99. Here, is the distribution of the random variable Cj{ijj) where j ~ [m]. Now, even though 
it is not possible to do that for general distributions, the fact that the sample is pseudo-random 
implies that 'D{'ip) is close, in a certain sense, to the uniform distribution on {±1}^. Now, for 
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uniform z G {±1}^ we have that J2i=i — ^'/^iog(K) w.p. 1 — ok(1), and the same holds 


for z ~ 'D{'ip). Therefore, since XOR(z) is fully determined given Yld=i and since Zi takes 

< d values in the interval [—d,d], we can take p'{z) = p"(Ylii=i ^i) where p" : M —)■ M is a degree 

< \y/K\og{K) polynomial that satisfies p''{Yld=i ^i) — XOR( 2 ;) whenever — y/K\og{K). 




Proving Theorem 1.3 in full 

Theorem 2.2 differs from Theorem 1.3 in two aspects. The first is that in Theorem 1.3 the com¬ 
pleteness parameter in the assumption can be arbitrarily close to ^ (rather than 0.99 in Theorem 
2.2) and at the same time, the conclusion holds for p-almost realizable distributions for arbitrarily 
small p (rather than p = 0.02 in Theorem 2.2). The second aspect, that is much more minor, is 
that in Theorem 2.2 the hard distribution can be chosen to be supported in { — 1,1,0}” X {±1}, 
while Theorem 1.3 is slightly more restrictive and requires that the distribution will be supported 
in {±1}” X {±1}. 

To address the first aspect, we will first reduce the random XOR problem to the random 
majority-of-g-XORs problem, and then we will follow a similar (but slightly more involved) argu¬ 
ment as the one described above. The reduction will work as follows. Given a XOR-formula with m 
XOR-constraints, the reduction will produce a MXOR-formula with ™ constraints, each of which 
is a majority of q random XOR-constraints from the original formula. This reduction will amplify 
the completeness parameter toward 1. 

To address the second aspect, we note that halfspaces on { — 1,1,0}” can be realized by halfspaces 
on {±1}^” using the map T : {—1,1, 0}” —>■ {±1}^” that is defined as follows: 

T(x) = (T(xi),...,T(xn)) , 


where for x G {0, —1,1}, T(x) 


'(1,1) x = l 

< (—1, —1) X = —1. It is not hard to see that for every w G M” 
.(-1,1) x = 0 


we have = h^)' o T where w' = |(tci, tci,...,rcn, tCn)- This observation shows that if there 
is no efficient algorithm with non-trivial performance on r 7 (n)-almost-reahzable distributions on 
{ —1,1,0}” X {±1}, then if there is no efficient algorithm with non-trivial performance on p 
almost-realizable distributions on {±1}” x {±1}. 


A road map 

In section A.l we analyze elementary properties of random formulas, and dehne accordingly a notion 
of a pseudo-random formula. We also show that if J = {Ci,..., Cm} C Xn^k is a pseudo-random 
formula, then for j ~ [m] and every assignment G {±1}”, the distribution of the random variable 
G {±1}^ is close to the uniform distribution in a certain sense. In section A.2 we show that 
for pseudo-random formulas, for every every assignment ip G {±1}”, the mapping Cj i—?• Cj{'p) can 
be approximated by a low-degree polynomial. Finally, the full reduction is outlined in section A.3. 
In section A.4 we briefly explain how our argument can be extended to prove Theorems 1.4 and 
1.5. 
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A Proof of Theorem 1.3 


A.l Pseudo-random formulas 

A partial K-tuple supported in A C [K] is a mapping C \ {±1}"' —)• {—1,1,*}^ such that the 
output coordinates corresponding to A are literals corresponding to jA] different variables, and the 
remaining coordinates are the constant function *. The size of C is jAj. We denote by Xn,K,A the 
collection of partial AT-tuples that are supported in A. We note that 


\Xn,K,A\ = (2n)(2n - 2) • ... • (2n + 2 - 2lAl) 


( 1 ) 


For A C [K] we denote by 11^ : {±1}^ — >• {—1,1,*}^ the function that maps all coordinates 
in [K] \ A to * and leaves the remaining coordinates unchanged. For C G we denote by 

Ca '■ {il}" —^ the partial AT-tuple 11^ oC. For a AT-formula J, and a partial iL-tuple C 

\\C' ^J'C =C'}| 

we define the frequency of C in J as Fr j(C') =--. We note that the for random J, if C is 
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a partial /f-tuple that is supported in a set A of size t, then Frj(C') is an average of m independent 
Bernulli variables with parameter pn,t '■= \ a\ ~ (2n)(2n-2)- ^■(2(n-t+i)) • Hoeffding’s bound 

we have 

Lemma A.l Let J = {Ci, ..., Cm} be a random formula. Then, for every partial K-tuple of size 
t we have Prj (|Frj(C') — pn,t\ >t) <2 exp (—2mr^) 

We say that J is {t,T)-pseudo-random if |Frj(C') —Pn,t'\ < for every partial A-tuple C of size 
t' < t. We say that J is t- pseudo-random if it is (iF, r)-pseudo-random. By lemma A.l, the fact 
that the number of partial iF-tuples is 

^ / K\ 

1 + ^ i^.j{2n){2n-2) • ... • (2n + 2 - 2j) < 2^(2n)^ < i2nf^ , 
and the union bound we have 


Lemma A.2 Let J = {Ci,... ,Cm} be a random formula. Then the probability that J is not 
T-pseudo-random is at most (2n)^^2exp (—2mr^) 

For a formula J = {Ci,..., Cm} C Xn,K and -ip £ {=*=1}"^) we denote by P( J, ip) the distribution of 
the random variable Cj{ip) G {±1}^ where j ~ Uni([m]). A vector z G { — 1,1, *}^ is supported in 
A C [K] if A = {i G [K] \ Zi *}. We say that a distribution T> on {±1}^ is {t,p)-close to the 
uniform distribution if for every z G {—1,1, *}^ that is supported in a set A G ([^|) we have that 
\Pi,,^v(nA{z') = z)-2-\^\\<p 


Lemma A.3 If J is {t,T)-pseudo-random then for every ip G {±1}”, 'D{ J,ip) is {t,n^T)-close to 
the uniform distribution. 


Proof Fix a vector z G { — 1,1, *}^ supported in A G (<j)- We have 
Pr (Ua{z') = z) = Fv aC/)A») = z) 

= ^\{j ^ I'm]] {Cj)A{ip) = z}\ 

1 


m 


E IUeHI(Q).4 = c)| 

CeXmK,A\CW=^ 

E 

CeXn,K,A\C(i^)=Z 


Denote ?7 = {C G Xn,K,A \ C{ip) = z} and note that \U\ = n{n — 1) • ... • (n — |A| + 1) = 
By the {t, r)-pseudo-randomness of J we have 


. ,Pr „ ((G)4(V’) =z)- 


E Frj(C) - 2-1^1 
c&u 


< 


+ \U\t 


c&u 

= ||t^|Pn,|A| - 2"'^' + |17|r = |[/|t < nfolr < nV 


□ 
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A.2 Approximately realizing assignments by polynomials 


In this section we will show that when a formula J = {Ci,... ,Cm} is pseudo-random then for 
every assignment tp G {±1}"' the mapping C i—>■ XOR(C'('0)) can be approximately realized by a 
low-degree polynomial on J. Namely, there is a low degree polynomial p : ,K {±1} such that 
on most A-tuples C G J, we have p{C) = X.OK{C{^p)). To this end, we must represent A-tuples 
as vectors. The way we will do this is the following. Recall that the indicator vector of a literal is 
the vector in {0, —1,1}"" whose all coordinates are zero except the coordinate corresponding to the 
literal, that is 1 (—1) if the literal is un-negated (negated). Also, we defined tt : t {0, —1,1}"^ 

such that vr(C') is a concatenation of K vectors, where the i’s vector is the indicator vector of the 
z’th literal in C. 

We will use the following version of Chernoff’s bound from [.52] . 

Lemma A.4 Let V be a distribution on {±1}'^, 1 > > 0 and ^ < a < ■ Assume that for 

every z G { — 1,1, *} that is supported in a set A of size r = \ldK~\ we have = z) < . 

Then 


Pr 

Z'^T) 


K 





< 2 exp 




) 


Lemma A.5 LetT be a distribution on {±1}^ that is {t, fi)-close to the uniform distribution, and 
let d such that ^ < d <t. Then, there exists a degree < d polynomial p : {±1}^ —)• M such that 


Pr (X0R(2;) ^ p{z)) < 2 exp 

z^T) 



V2 2R:’2 d ) 



Proof Let A : {±1}^ —)■ M be the linear mapping A{z) = '^ote that XOR(z) is fully 

determined given A{z). Namely there is / : M —>■ {±1} such that XOR = / o A. We note that 
the image of A has at most d -|- 1 values in the interval [—d, d]. Therefore, there is a degree < d 
polynomial g : M —)■ M that coincides with / on A ({±1}^) n [—d, d]. Consider now the degree < d 
polynomial p = qo A. We have that p coincides with XOR whenever | ^*1 — Hence, we have 

Ft {XOR{z)^p{z))< Pr 

z^D Z'^D 

The lemma will now follow from Lemma A.4 when /3 = ^ and a = ^ ^ d^ ' remains to show 

that the conditions of the Lemmas hold, namely, that a < and that for every z G {—1,1,*} 
that is supported in a set A of size r = [/3A] = d we have {YiA{z’) = z) < a^. The first 

conditions follows from the requirement that ^ < d. For the second condition, since V is 
(t, /i)-close to the uniform distribution and d <t we have 



a 


d 


> 


> 





2'^-V 

d 


□ 


16 













Lemma A.6 Let J = {Ci,... ,Cm} £ ^n,K be {t,T)-pseudo-random, '0 £ {il}” o.nd d such that 
^ < d < t. Then, there exists a degree < d polynomial p : {0, —1, —)■ M such that 

iXOR{C,m + p{7r{C,))) < 2exp (^2 + 2A’ 2 + ^ 

Proof Since J is (t, r)-pseudo-random and d < t, it is also (d, r)-pseudo-random. By Lemma A.3 
'D( J, ^|^) is (d, n'^T)-close to the uniform distribution. Therefore, by Lemma A.5 there is degree < d 
polynomial p' for which 

^^ ^ ■ 

Now, let T : —>■ be the linear map 


T {vi\v 2 \ . . . \vk) = {{vi,'lp), {V 2 ,'lp), {vK,'lp)) 


Note that for all C G Xn,K, T{'k{C)) = C'(V'). Consider the degree < d polynomial p = p' oT. We 
have 


Pr (XOR(C,(V^))/p(7r(C-,))) 


< 


Pr (XOR(C,(V’))7^p'(r(vr(C,)))) 

j~[mJ 

Pr (XOR(C,(V’))^p'(C-,W)) 

j~[mJ 


Pr (XOR(z) ^ p\z)) 


1 d 1 


2expl-D|- + —.- + 


nr 



□ 


A.3 The reduction 

Step I: Amplifying the gap — from XOR to majority of XORs 
For odd q, we define the predicate MXORg^i^ : {±1}'?'^ —)■ {±1} by 

MXORg,K (z) = MAJ (XOR(2;i, ...,Zk),..., XOR{z(^q_i)K+ii ■■■, ZqK)) 

A {q,K)-tuple is an element in Xn,q,K ■= {^n,Ky■ For a (g, A)-tuple C = (C^,..., C"^) and 
an assignment ip G {il}"" we denote C{ip) = [c^(ip),..., (ip)) G {±1}'?'^. A (g, A)-formula 
is a collection J = {Ci,... ,Cm} of (g, A)-tuples. An instance to the (g, A)-MXOR problem is a 
(g, K)-formula, and the goal is to find an assignment ip G {±1}” that maximizes VAL^ mxor(<^) •= 
|{j.MXORg,K(C»(V’))-i}| _ define the value of J as VALMXOR(d) := max^g{±i}n VAL. 0 ^MXOR(d'). 

For m = m{n), (g, K) = {q{n),K{n)) and \> rj = r]{n) > 0, we say that the problem CSP5^“'^’^“'^(MXORq,if) 
is easy, if there exists an efficient randomized algorithm, A with the following properties. Its input 
is a (g, A)-formula J with n variables and m constraints and its output satisfies: 

• If VALMXOR(d) >1-11, then 

3 

Pr {A{J) = “non-random”) > - 
coins oi A 4 


17 







• If J is random^ then, with probability 1 — o,i(l) over the choice of J, 

Pr = “random”) > - . 

coins oi A. 4 


Lemma A 


randjl—4expf—2 (t;— g) 

.7 The problem (XORk) can be efficiently reduced to CSF, m , ^ ^(MXORg^x) 

L 5 J 


We will use the following version of Chernoff’s bound from [52]. 

Lemma A.8 Let T be a distribution on {0, l}*^ and rj < ^. Assume that for every A C [g] we have 
Prz~D (Vi € A, Zi = 1) < . Then 



< exp 




< exp 




Proof (of Lemma A.7) Given a A-formula J = {Ci,..., Cm} £ Xn,K we will produce a (g, K)- 
formula J' as follows. We first randomly throw m — of J’s A-tuples. Then, we randomly 

partition the remaining tuples into equally sized ordered bundles. For each such bundle 
..., Cj^}, we add to J' the (g, A)-tuple C = (C^j,..., Cj^). To see that the reduction works 
note that: 


• If J is random then so is J'. 


• Assume now that VALxor(V) > 1—r/, and let ^|J G {il}” be an assignment with VAL^_xor(V) > 
1 — rj. Consider a single random bundle {Cj^^,... ,Cj^}. By Lemma A.8 we have that 
the probability that for most A-tuples in the bundle we have XORg^(C'(V')) = —1 is < 
exp ^—2 (?/ — q^- Hence, E[1 — VALmxor(V 0] < exp ^—2 {rj — \')^ q^- By Markov’s in¬ 

equality we have that VALmxor(V 0 > 1 — 4 exp ^—2 (r/ — q^ w.p. > |. 

□ 


Step II: Making the sample scattered — from (MXOR) to (MXOR, -iMXOR) 

A labeled {q, K)-formula is a collection J = {{Ci,yi )..., {Cm^Vm)} C Xn,q,K- An instance to the 
{q, A)-(MXOR, -iMXOR) problem is a labeled {q, A)-formula, and the goal is to find an assignment 
■0 G {±1}*^ that maximizes VAL. 0 ^mxor(V) := IH-MXORg^^(c,A))-y,}| ^ define the value of 
J as VALmxor(V) := max,^g{±i}n VAL^^mxor(V). For m = m{n), {q,K) = {q{n),K{n)) and 
5 > 7 / = T/(n) > 0 , we say that the problem CSP^“'^V-’7^]viXORg^x, “'MXORq^i^) is easy, if there 
exists an efficient randomized algorithm, A with the following properties. Its input is a labeled 
{q, A)-formula J with n variables and m constraints and its output satisfies: 

• If VALmxor(V) > 1 - then 

3 

Pr {A{J) = “non-random”) > - 
coins of ^ 4 

^To be precise, the {q, X)-tuples are chosen uniformly, and independently from one another. 
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• If J is random^ then, with probability 1 — o,i(l) over the choice of J, 


coins of A 


coins of ^ 4 


Pr (A(J) = “random”) > - . 


Lemma A.9 The problem '^(MXORg^/i') can be efficiently reduced to CSP™*^’^ ’'(MXORg^ii', -iMXORg^ 

Proof Given an instance J = {Ci,..., Cm} to CSP)))'^'^’^“’’(MXORg, we will produce an instance 
J' to CSP)))'^‘^’^“’'(MXORq^ii',-iMXOR^^i^) as follows. For each Cj, w.p. ^ we will add to J' the 
pair (Cj, 1), and w.p. | we will add the pair (C', —1) where C' = (Cj^,... ■,C'^) is obtained from 
Cj = (Cj ,... ,Cj) by flipping, for each Cj, the sign of the hrst literal. It is not hard to see that 
if J is random then so is J'. Also, for every f G {ztl}"', VALj,^]v[XOR('^) = VALj,^MXOR('^0) and 
therefore, if VALmxor(<^) > ^ — V then VALmxor(<^ 0 > 1 — as well. □ 

Step III: Enforcing pseudo-randomness 

We say that a labeled {q, A)-formula J is (t, t)- pseudo-random if the A-formula consisting of all the 
A-tuples that appear in J is (t,r)-pseudo-random. For m = m{n), {q,K) = {q{n),K{n)), (t,r) = 

{t{n),T{n)) and ^ > rj = ri{n) > 0, we say that the problem CSP™°j’^j~^(MXORq^/i',-iMXORg^x) 
is easy, if there exists an efficient randomized algorithm, A with the following properties. Its input 
is a labeled {q, A)-formula J with n variables and m constraints that is {t, r)-pseudo-random. Its 


output satishes: 

• If VALmxor(«/) > 1 - ??, then 


coins of A 


Pr (A( J) 



• If J is random® then, with probability 1 — OnfCj over the choice of J, 


coins oi A 4 


Pr {A{J) = “random”) > ^ . 


Lemma A.10 For r = r{n) > 4, the problem CSP™*^’^ ^(MXORg^ic-,-iMXORg^i^) can be effi¬ 
ciently reduced to (MXORg -iMXORg i^). 

mr,(r,n 3 ) ’ ’ 


Proof Given and instance J to ’^(MXORgX; -iMXORgx) we will simply check weather 

it is ^r, n“ J^-pseudo-random or not. If it not, we will say that J is not random. Otherwise, we will 
leave it as is as an instance to (mxORqX) “'hlXORg^)- To see that this reduction 


works, note that 


• If VALmxor(T) > rj, we will either say that it is not random or produce an (r, n 4 j-pseudo¬ 


random instance with value > rj. 


®To be precise, the (q, X)-tuples and the labels are chosen uniformly, and independently from one another. 
®To be precise, J is chosen uniformly at random from all (t, r)-pseudo-random labeled [q, Xl-formulas. 
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• If J is random, by lemma A.2, it is ^r, n 4 ^-pseudo-random with probability at least 

1 — (2n)'^2exp 2n'’n“2^ = 1 — 2exp ^Ariog(2n) — 2n'’n“2^ > 1 — o„(l) . 

Hence, w.p. > 1 — o„(l) the reduction will produce an instance. Now, conditioning on this 
event, the produced formulas is a random labeled (g, A')-formula that is ^r,n“ 4 ^-pseudo¬ 
random. 

□ 


Step IV: Prom MXOR to polynomials 

Let be the hypothesis class of all functions h : {0, —1,1}“ —)• {±} that are thresholds of 

degree < d polynomials. For u = u{n), d = d{n), m = m{n) and rj = 'q{n) we consider the problem 
POL(u, of distinguishing a strongly-scattered sample from a sample with ErrpoLd('S') < V- 

Concretely, the input is a sample S = {(xi,yi),..., ixm,ym)} C { — 1,1,0}“ x {±1}, and we say 
that the problem is easy if there exists an efficient randomized algorithm, A with the following 
properties. Its input is such a sample S, and its output satisfies: 

• If ErrpoL,; (<5) < rj, then 

Pr iA{S) = “almost-realizable”) > ^ 
coins oi A 4 

• If S is strongly scattered then, with probability 1 — OniX) over the choice of the labels, 

Pr (AiJ) = “scattered”) > ^ . 
coins of yt 4 


Lemma A.11 For d such that < d < t, ,-iMXORg^x) can effi¬ 

ciently reduced to FOL{nqK,d)X^^^^'^ where 


rj' = rj + 2q exp 



/I d 1 2*^ 

^2 2K' 2 d ) 



Proof Given a labeled {q, iL)-formula J = {(Ci, yi ),..., {Cm, ym)} C Xn,q,K x {=tl} we will simply 
produce the sample S = {(7r(C'i), yi),..., (7r(C'm), 2/m)} C {-1,1,0}“'?'^ x {±1}. Here, H : Xn,q,K 
{—1,1,0}“'?^ is the mapping 7r(C'^,..., CX = {t^{C^), ■ ■ ■, 'k{CX), where tt : Xn^K {“1,1,0}“^ 
is as defined in section A.2. 

Clearly if J is random then S is scattered. It remains to show that if VALmxor(<^) > 1 ~ ^ 
and J is {t, r)-pseudo-random then there is a degree < d polynomial that errs on < r]' fraction of 
the examples. 

Indeed, let V' G {=tl}" be an assignment that satisfies > 1 — y fraction of J’th (g, iL)-tuples. 
By lemma A. 6, there is a polynomial p : { — 1,1,0}“^ —)• M of degree < d that satishes p{'k{C)) = 
XOR(C'(?/))) on 1 — 2exp \ fraction of J’s iL-tuples (here, J’s K- 

tuples are the AT-tuples that appear in one of J’s (g, Ar)-tuples). 

Let p' : {—1,1,0}“'^^ —)■ M be the degree < d polynomialp'(x) = Yl'j=iPX{j-i)K+i, ■ ■ ■, ^jx)- R 

is not hard to check that sign {p'{' k{C))) = MXOR(C(V’)) on 1—2gexp ^5 + 2 X’ 2 R ~ —3^) -^) 
fraction of J’s {q, iL)-tuples. 

Therefore, we have that sign (p'( 7 r(C'j))) / yi for at most T/-|- 2 (/exp ^5 + ^, ^ + 

fraction of the samples in S. □ 
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Step V: Prom polynomials to halfspaces 

Let u = u{n), m = m{n) and rj = r]{n). Similarly to POL(n, , we define HALF(rt)m as 

the problem of distinguishing a strongly-scattered sample consisting of m examples in {±1}“ x{±l}, 
from a sample with ErrHALF(5') < V- 

Lemma A.12 The problem POL(tt, can efficiently reduced to HALF (2(u -|- 

Proof It will be convenient to decompose the reduction into two steps, where the second only 
deals with the issue of replacing {—1,1, 0}^“'*'^^ by {±l}2(“+i) _ Given a sample 

-S' = {{xi,yi )..., {xm, ym)} C {-1,1,0}“ X {±1} 


the reduction will first produce the sample 

piS) = {(p(xi), yi) . . . , {piXm),ym)} C {-1, 1, X {±1} , 

where p : { — 1,1,0}“ —> { —1,1,is defined as follows. We index the coordinates in 
{ —1,1,by the functions in ([u] U and we let 

V/ E ([u] U {*})['^', Pf{x) = n^=iX/Q) . 

(where x* := 1). It is not hard to see that the collection of degree < d polynomial functions from 
{ —1,1,0}“ to M equals to 

(x !->• {w,p{x)) I w E . 

Hence, ErrpoL„,d(5') = ErrHALF(/o(*S')). 

In the second step the reduction will produce the sample 

^ip{S)) = {(^(y(xi)), yi) . . . , i^{p{Xm)),ym)} C {±1}2(“+1)^ X {±1} 

'h : {—1, —)> {±1}^^“+^)'* that is defined as follows: 

^(x) = (4'(xi),...,^'(xn)) , 


where for x E {0,—1,1}, 'I'(x) 


(1,1) x = l 

< (-1,-1) X = —1. It is not hard to see that for every w E 
.(- 1 , 1 ) ^ = 0 


we have = h^' o 'L where w' = ..., rc(u+i)d). Therefore we have 

ErrHALF(^(p(5'))) < ErruALF(/o(5')) = ErrpoL„,d(5) 

Also, it is clear that if S is strongly scattered then so is T(/9(5)). To summarize, the reduction 
S 'I'(p(S')) forms a reduction from POL(m, to HALF (2(u -|- 1)*^)^□ 


Connecting the dots 

We start with the hard problem CSP™^’^(XORa')- Using Lemma A.7 and Lemma A.9, we reduce 

it to CSP^“‘^’^ (MXOR^^a:, “'MXORq^A:)- Since y is bounded away from 1, this can be done with 
L g -I 

y = C • A for a constant C = C{p). We can reduce farther to (MXORq^AT, -ilVIXORg^x) 

(by simply throwing away random tuples from the input formula) 


21 



Now, we use Lemma A.10 to reduce to (MXOR„ x, “'MXOR„ i^). Using 

l,n ^) 

s—scat,77' 


Lemma A.11, for every d such that ”, —^ < d < r we can reduce to POL(nC'iL^, 


for 


r]' <2 


-K 


+ 2CK exp ( -2 ( — - 


^d-l^d-r+l 


K 


We will choose d such that d = o(r) and d = uj (^y/\og{K)K^ (the exact choice depend on the 
assumption we start with and will be specified later). R is not hard to check that for such a choice, 

it holds that for large enough r and K, ’'+1 < d < r — 1 and r]' < for a constant 

Cl > 0. 

Now, by Lemma A. 12 we can reduce farther to HALF(n^'^)®7-T*’’^ • Putting n' = n^'^, we 

conclude that HALF(n')^~r)ii*’’^ . Since d = o{r), this can be reduced farther (by simply randomly 

n'^ar 

throwing examples from the input sample) to HALF(n')^7j^'^’^*'’^ for any constant a > 0. By Theorem 
2 . 1 , we conclude that there is no efficient learning algorithm that can return a hypothesis with non¬ 
trivial error on a distribution TD on {±1}"' x {± 1 } that is r/'-almost realizable by halfspaces. 

For the choice of d and the calculation of r]' in terms of n', we split to two cases, according to 
the assumption we started with. 

Case 1 (Assumption 1.1). Here, r = clog{K)'/K and K is constant. We will choose 

d = log3[K)y/K. We will have ij' < 2 Since K can be arbitrarily large, rj' can be 

arbitrarily small. 

Case 2 (Assumption 1.2). Here, r = cK and K = log*(n). Here, we will choose d = 

Note that 


j< _ 

log log K ■ 


< 

2 ^ (log log if) 

— 

„ log«(n) 

2 (log log if) 

< 

2-log‘>-Rn) 

^(n) 

< log^’''^(n). 

s —1 

r]' < 


Since s can be arbitrarily large, we can get rj' < 2 


A.4 How to prove Theorems 1.4 and 1.5? 

We next briefly explain how our argument can be extended to prove Theorems 1.4 and 1.5. 

Theorem 1.4 is proved analogously to Theorem 1.3. The only difference is that we have to verify 
certain properties of the vector defining the halfspace that almost realizes the sample. Namely, we 
have to make sure that (i) the sum of its coefficients of is polynomial in the dimension and (ii) that 
whenever we guarantee that its prediction is correct, its inner product with the instance is > 1 
in absolute value. These two facts can be straight forwardly verified, by carefully going over the 
proof. 

Theorem 1.5 can also be proved analogously to Theorem 1.3. The only difference is that we 
have to verify that (i) the problem we start with cannot be solved efficiently using statistical queries 
and that (ii) the reduction steps can be done using statistical queries. This strategy can indeed 


22 










be carried out, if one is using the result [39] of Feldman, Perkins and Vempala, that shows the 
SQ-hardness of the initial problem. However, a simpler strategy can be applied. Concretely, we 
can use our argument to reduce the uniform iF-sparse-parity learning problem to the problems of 
learning halfspaces. In this parity problem, the learning algorithm is given an access to examples 
{x,h{x)) where x G {±1}” is uniformly distributed and h computes the XOR of K unknown 
variables. It is known [22, 21] that no SQ-algorithm for the problem can return a classifier with 
error < ^ using queries with error parameters Lemma A.5 shows that for 

any XOR of K variables, there is a polynomial threshold function of degree d that agree with h on 
all but 2exp ^^-fraction of the examples (w.r.t. the uniform distribution). As in the proof of 
Theorem 1.3, this fact establishes a reduction to the problem of agnostically learning halfspaces. 
All is left to show is that this reduction can be implemented using statistical queries. 

Indeed, the reduction is of the following form. It introduces a mapping 'I' : {±1}"' —)• {±1}”’^ 
and reduce the original learning problem to the problem of learning (T(x), h{x)), where x is sampled 
from the original distribution (the uniform distribution in our case). Now, given a statistical query 
Q ■■ {±1}” X {±1} —)• {±1}, in order to evaluate [Q('k(x), h{x))], we can simply query the oracle 
of the original problem with the function Q{x,y) := Q{'^{x),y). 
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